Targeted Gene Metagenomic Data Analysis ◾ 255
how the reads and barcodes are stored. The reads also might be sequenced as single end or
paired end. The paired-end reads are usually stored in two FASTQ files.
7.2.2 Metagenomic Features
After the preprocessing, the raw reads of the targeted gene (16S rRNA) are dereplicated
by reducing the abundance of the similar reads based on a Hamming distance threshold.
The dereplicated reads are then clustered by collapsing the complete sets of reads into a
grouping units called operational taxonomic units (OTUs) [5]. Once the OTUs have been
formed, representative sequences (features) can either be selected from those OTUs or the
sequences of each OTUs are denoised to remove potential errors before selecting represen-
tative sequences. In the following, we will discuss both clustering and denoising.
7.2.2.1 Clustering
There are three approaches for clustering the preprocessed amplicon reads. The first is the
de novo clustering [6] in which the reads are clustered into OTUs according to their simi-
larity at a specified threshold. It compares each read against each other and then it groups
sequences into OTUs. The second is called open-reference clustering in which reads are
clustered around previously annotated reference sequences by sequence classification or
searching in a database (Greengenes database) and the reads that do not cluster around
reference sequences are clustered with de novo method. The third is the closed-reference
clustering in which reads are clustered around reference sequences and the reads that do
not find reference sequences are removed [7]. The closed-reference clustering may intro-
duce reference bias to the clustering process since the reads without references will be
removed, while they may represent novel species or taxa. The de novo clustering is pre-
ferred, although it may be computationally expensive other than the other two clustering
methods. The algorithms for de novo clustering include hierarchical clustering and heu-
ristic clustering. The hierarchical clustering method requires a matrix for the pairwise dis-
tances between all unique sequences. A hierarchical tree is constructed from the distance
matrix, and then the OTUs are constructed from the tree based on a distance threshold.
The heuristic clustering uses pairwise sequence alignment to construct a distance
matrix one by one instead of computing all distances in a single step. The heuristic cluster-
ing method generates OTUs in a greedy incremental strategy that utilizes one sequence as
a seed to represent a cluster. Then, each one of the unique reads is compared with the seeds
of existing clusters. A read is assigned to a cluster if the pairwise distance between a unique
read and a seed meets a predefined threshold. If a read does not join any cluster, that read
will become a seed for a new cluster. The final clusters represent the OTUs. The heuristic
clustering method is more computationally efficient than the hierarchical clustering. There
are a variety of heuristic clustering methods that vary in seed selection and distance cal-
culation. Examples of the tools that use heuristic clustering for constructing OTUs include
USEARCH and CD-HIT.
In general, an OTU constructed by any of the clustering methods is considered as a tax-
onomic group. From this point, the analysis can move on to the taxonomic group assign-
ment, construction of the phylogenetic tree, and diversity analysis but the latest analytic